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FROM e-ENTROPY TO KL-ENTROPY: ANALYSIS OF MINIMUM 
INFORMATION COMPLEXITY DENSITY ESTIMATION 

By Tong Zhang 
Yahoo Research 

We consider an extension of £-entropy to a KL-divergence based 
complexity measure for randomized density estimation methods. Based 
on this extension, we develop a general information-theoretical in- 
equality that measures the statistical complexity of some determinis- 
tic and randomized density estimators. Consequences of the new in- 
equality will be presented. In particular, we show that this technique 
can lead to improvements of some classical results concerning the 
convergence of minimum description length and Bayesian posterior 
distributions. Moreover, we are able to derive clean finite-sample con- 
vergence bounds that are not obtainable using previous approaches. 

1. Introduction. The purpose of this paper is to study a class of complex- 
ity minimization based density estimation methods using a generalization 
of e-entropy, which has become a central technical tool in the traditional 
finite-sample convergence analysis. Specifically, we derive a simple yet gen- 
eral information-theoretical inequality that can be used to measure the con- 
vergence of this very basic inequality. 

We shall first introduce basic notation used in the paper. Consider a 
sample space X and a measure fi on X (with respect to some c-field). In 
statistical inference, nature picks a probability measure Q on X which is 
unknown. We assume that Q has a density q with respect to fi. In density 
estimation, we consider a set of probability densities p{-\9) (with respect 
to on A") indexed by 6 €T. Without causing any confusion, we may also 
occasionally denote the model family {p{-\9):9 G F} by the same symbol 
r. Throughout this paper, we always denote the true underlying density 
by q, and we do not assume that q belongs to the model class F. Given 
r, our goal is to select a density p{-\0) G T based on the observed data 
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X = {Xi, . . . ,Xn} S A'", such that p{-\0) is as close to q as possible when 
measured by a certain distance function (which we shall specify later). 

In the framework considered in this paper, we assume that there is a 
prior distribution dn{9) on the parameter space T that is independent of 
the observed data. For notational simplicity, we shall call any observation 
X dependent probability density wx{0) on T (measurable on x T) with 
respect to dir^O) a posterior randomization measure, or simply a posterior. 
In particular, a posterior randomization measure in our sense is not lim- 
ited to a Bayesian posterior distribution, which has a very specific meaning. 
We are interested in the density estimation performance of randomized es- 
timators that draw 6 according to posterior randomization measures wx{()) 
obtained from a class of density estimation schemes. We should note that 
in this framework, our density estimator is completely characterized by the 
associated posterior wx{0)- 

The paper is organized as follows. In Section 2, we introduce a gener- 
alization of e-entropy for randomized estimation methods, which we call 
KL- entropy. Then a fundamental information-theoretical inequality, which 
forms the basis of our approach, will be obtained. Section 3 introduces the 
general information complexity minimization (ICM) density estimation for- 
mulation, where we derive various finite-sample convergence bounds using 
the fundamental information-theoretical inequality established earlier. Sec- 
tions 4 and 5 apply the analysis to the case of minimum description length 
(MDL) estimators and to the convergence of Bayesian posterior distribu- 
tions. In particular, we are able to simplify and improve most results in [1] 
as well as various recent analysis on the consistency and concentration of 
Bayesian posterior distributions. Some concluding remarks will be presented 
in Section 6. 

Throughout this paper, we ignore the measurability issue, and assume 
that all quantities appearing in the derivations are measurable. Similarly to 
empirical process theory [14], the analysis can also be written in the language 
of outer-expectations, so that the measurability requirement imposed in this 
paper can be relaxed. 

2. The basic information-theoretical inequality. In this section we intro- 
duce an information-theoretical complexity measure of randomized estima- 
tors represented as posterior randomization measures. As we shall see, this 
quantity directly generalizes the concept of e-entropy for deterministic es- 
timators. We also develop a simple yet very general information-theoretical 
inequality, which bounds the convergence behavior of an arbitrary random- 
ized estimator using the introduced complexity measure. This inequality is 
the foundation of the approach introduced in this paper. 



MINIMUM COMPLEXITY ESTIMATION 



3 



Definition 2.1. Consider a probability density 'w{-) on T with respect 
to vr. The KL-divergence D}<ii^{w dnWdTr) is defined as 



Jt 

For any posterior randomization measure wx, we define its KL-entropy with 
respect to vr as DKhiwx dTrWdTr). 

Note that D}<ii^{w dnWdTr) may not always be finite. However, it is always 
nonnegative. 

KL-divergence is a rather standard information-theoretical concept. In 
this section we show that it can be used to measure the complexity of a 
randomized estimator. We can immediately see that the quantity directly 
generalizes the concept of e-entropy on an e-net; assuming that we have A'^ 
points in an e-net, we may consider a prior that puts a mass of on every 
point. It is easy to see that any deterministic estimator in the e-net can be 
regarded as a randomized estimator that is concentrated on one of the N 
points with posterior weight (and weight of zero elsewhere). Clearly this 
estimator has a KL-entropy of InA^, which is essentially the e-entropy. In 
fact, it is also easy to verify that any randomized estimator on the e-net 
has a KL-entropy bounded by its e-entropy In A^. Therefore e-entropy is the 
worst-case KL-entropy on an e-net with a uniform prior. 

The concept of e-entropy can be regarded as a notion to measure the 
complexity of an explicit discretization, usually for a deterministic estimator 
on a discrete e-net. The concept of KL-entropy can be regarded as a notation 
to measure the complexity of a randomized estimation method, where the 
discretization is done implicitly through randomization with respect to an 
arbitrary prior. This difference is important for practical purposes since 
it is usually impossible (or very difficult) to perform computation on an 
explicitly discretized e-net. Therefore estimators based on e-nets are often 
of theoretical interest only. However, it is often feasible to draw samples 
from a posterior randomization measure with respect to a continuous prior 
by using standard Monte Carlo techniques. Therefore randomized estimation 
methods are potentially useful for practical problems. 

Since KL-entropy allows nonuniform priors, the concept can directly char- 
acterize local adaptivity of randomized estimators when we put more prior 
mass in certain regions of the model family. In contrast, e-entropy is a nota- 
tion that tries to treat every part of the space equally, which may not give 
the best possible results. For example, for convergence of posterior distribu- 
tions, the fact that entropy conditions are not always the most appropriate 
was pointed out in [4], pages 522-523. The issue of adaptivity (and related 
nonuniform prior) cannot be directly addressed with e-entropy. In the lit- 
erature, one has to employ additional techniques such as peeling (e.g., see 
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[13]) for this purpose. As a comparison, the abihty to use a nonuniform prior 
directly in our analysis is conceptually useful. Putting a large prior mass in 
a certain region indicates that we want to achieve a more accurate estimate 
in that region, in exchange for slower convergence in a region with smaller 
prior mass. The prior structure reflects our belief that the true density is 
more likely to have a certain form than some alternative forms. Therefore 
the theoretical analysis should also imply a more accurate estimate when 
we are lucky enough to guess the true density q correctly by putting a large 
prior mass around it. As we will see later, finite-sample convergence bounds 
derived in this paper using KL-entropy have this behavior. 

Next we prove a simple information-theoretical inequality using the KL- 
entropy of randomized estimators, which forms the basis of our analysis. For 
a real-valued function f{6) on T, we denote by E7r/(^) the expectation of 
/(•) with respect to vr. Similarly, for a real-valued function i{x) on X, we 
denote by Eg£(x) the expectation of £{■) with respect to the true underlying 
distribution q. We also use Ex to denote the expectation with respect to 
the observation X (n independent samples from q). 

The key ingredient of our analysis using KL-entropy is a well-known con- 
vex duality, which has already been used in some recent machine learning 
papers to study sample complexity bounds. For example, see [8, 11]. For 
completeness, we include a simple information-theoretical proof. 

Proposition 2.1. Assume that f{6) is a measurable real-valued func- 
tion on T, and w{6) is a density with respect to vr; we have 

EM0)f{O) < DKL{wd7r\\d7r)+lnB^eMf{0)). 

Proof. We assume that E7rexp(/(0)) < oo; otherwise the bound is triv- 
ial. Consider v{9) = exp(/(0))/E7r exp(/(0)). Since E,.j^v{9) = 1, we can re- 
gard it as a density with respect to vr. Using this definition, it is easy to 
verify that the inequality in Proposition 2.1 can be rewritten equivalently 
as 

B^w{e) liiw{e) + InE^ exp(/(0)) - E^w{9)f{e) = Dkl{w dvrl It; dvr) > 0, 

which is a well-known information-theoretical inequality, and follows easily 
from Jensen's inequality. □ 

The main technical result which forms the basis of the paper is given by 
the following lemma, where we assume that wx{G) is a posterior (represented 
as a density with respect to tt that depends on X and is measurable on 
A-" X r). 
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Lemma 2.1. Consider any posterior wx{0). Let a and (3 he two real 
numbers. The following inequality holds for all measurable real-valued func- 
tions Lx{0) on A'" X r.- 



<E^ 



where Ex is the expectation with respect to the observation X. 

Proof. From Proposition 2.1, we obtain 

L{X)= ^^wx {e){Lx{0)-aln Exe'^^^ ) - Dkl {wx dvr | | dvr) 

< lnE^exp(Lx(0) - a In Ex e^^^ 

Now applying Fubini's theorem to interchange the order of integration, we 
have 



Exe^(^) < ExE^e^^^^^-^^^^^'^^P^^^^^^)) =E 



Exe^^(^) 



Remark 2.1. The importance of the above inequaUty is that the left- 
hand side is a quantity that involves an arbitrary posterior randomization 
measure wx dir. The right-hand side is a numerical constant independent of 
the estimator wx- Therefore the inequality gives a bound that can be applied 
to an arbitrary randomized estimator. The remaining issue is merely how to 
interpret the resulting bound, which we shall focus on later in this paper. 

Remark 2.2. The main technical ingredients of the proof are moti- 
vated from techniques in the recent machine learning literature. The general 
idea for analyzing randomized estimators using Fubini's theorem and decou- 
pling was already in [17]. The specific decoupling mechanism using Propo- 
sition 2.1 appeared in [3]; see [8, 11] for related problems. A simplified form 
of Lemma 2.1 was used in [18] to analyze Bayesian posterior distributions. 

The following bound is a straightforward consequence of Lemma 2.1. Note 
that for density estimation, the loss i0{x) has the form of i{p{x\9)), where 
£{■) is a scaled log-loss. 

Theorem 2.1. We use the notation of Lemma 2.1. Let X = {Xi, . . . ,X„,} 
be n-samples that are independently drawn from q. Consider a measurable 
function le{x) .T x X ^ R, and real numbers a and (3, and define 

1 / E„e"^»(^) V 
c„(a,/5) = —In Ett — „„ , . . 
^ ''^^ n \B^e-P^e{x) J 
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Then Vt, the following event holds with probability at least 1 — exp{—t): 
< hc„(a,/3). 

77, 

Moreover, we have the expected risk bound 

< Ejc h Cn[a,p). 

n 

Proof. We use the notation of Lemma 2.1, with Lx{0) = — Y.i=i^e{Xi)- 
If we define 

L{X) = E^wxie)iLx{e) - alnExe^^^(^)) - Dkl{wx dnWdn) 

= E^wxiO) - r7alnEge-^^«(")j - Dkl{wx dnWdTT), 

then by Lemma 2.1 we have Exe^^^^ < e"^"("'^). This imphes Ve: e^P{L{X) > 
e) < e"'^"("'^). Now given any t, and letting e = t + ncn{a,f3), we obtain 

e*+"^"("'^)p(L(X) > t + ncnia, f3)) < e"'^"^"'^). 

That is, with probabihty at least 1 — e~*, L{X) <e = nCn{a,P) + t. By 
rearranging the equation, we establish the first inequality of the theorem. 

To prove the second inequality, we stih start with Exe^^^^ < e""^"^"'^^ 
from Lemma 2.1. From Jensen's inequahty with the convex function e^, 
we obtain e^^-^(^) < E^e^^^) < e"^"("'^). That is, ExL{X) < nc{a,(3). By 
rearranging the equation, we obtain the desired bound. □ 

Remark 2.3. The special case of Theorem 2.1 with a = /3 = 1 is very 
useful since in this case the term Cn(a, P) vanishes. In fact, in order to obtain 
the correct rate of convergence for nonparametric problems, it is sufficient to 
choose a = P = 1. The more complicated case with general a and /3 is only 
needed for parametric problems, where we would like to obtain a convergence 
rate of the order 0(1/?t,). In such cases the choice of a = /3 = 1 would lead 
to a rate of 0{lnn/n), which is suboptimal. 

3. Information complexity minimization. Let 5 be a predefined set of 
densities on T with respect to the prior vr. We consider a general information 
complexity minimization estimator. 



(1) Wx = argmin 



-E^w{e)J2^npiXi\e) + XDKL{wdTT\\dTT) 



i=l 
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Given the true density q, if we define 

(2) Rx{w) = -EM0)J2^^4vjk + -DKLiwdTrWdTT), 

n p{Xi\fi) n 

then it is clear that 

Wx = argmin Rx{w). 

The above estimation procedure finds a randomized estimator by min- 
imizing the regularized empirical risk Rx{w) among all possible densities 
with respect to the prior vr in a predefined set S. The purpose of this sec- 
tion is to study the performance of this estimator using Theorem 2.1. For 
simplicity, we shall only study the expected performance using the second 
inequality, although similar results can be obtained using the first inequality 
(which leads to exponential probability bounds). 

One may define the true risk of w by replacing the empirical expectation 
in (2) with the true expectation with respect to q: 

(3) Rx{w)=BMO)DKUq\\pm) + - Dkl{w d7r\\d7r) , 

n 

where Dkl('?||p) = Eqln(g(x)/p(2;)) is the KL-divergence between q and p. 
The information complexity minimizer in (1) can be regarded as an approx- 
imate solution to (3) using empirical expectation. 

Using empirical process techniques, one can typically expect to bound 
Rx{w) in terms of Rx{w). Unfortunately, it does not work in our case 
since -Dkl(9||p) is not well defined for all p. This implies that as long as 
w has nonzero concentration around a density p with D}<ii^{q\\p) = +oo, 
then Rx{w) = +oo. Therefore we may have Rx{wx) = +oo with nonzero 
probability even when the sample size approaches infinity. 

A remedy is to use a distance function that is always well defined. In 
statistics, one often considers the p-divergence for p G (0, 1), which is defined 
as 



(4) D,{q\\p) = -——E, 



p(l-p) n \q{x) 

This divergence is always well defined and -DrlC'zIIp) = ^^^p->o D p{q\\p) . In 
the statistical literature, convergence results were often specified under the 
squared Hellinger distance (p = 0.5). In this paper we specify convergence 
results with general p. We shall mention that bounds derived in this paper 
will become trivial when p ^0. This is consistent with the above discussion 
since Rx (corresponding to p = 0) may not converge at all. However, under 
additional assumptions, such as the boundedness of q/p, Z)kl('?||p) exists 
and can be bounded using the p-divergence Dp{q\\p). 
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A concept related to the /3-divergence in (4) is the Renyi entropy intro- 
duced in [9]. The notion has been widely used in information theory. Up to 
a scaling factor, it can be defined as 



p 



Note that the standard definition of Renyi entropy in the literature is pD^^{q\ \p). 
We employ a scaled version in this paper for compatibility with our p- 
divergence definition. Using the inequality 1 — a;<— lna;< — 1 (x E 
[0, 1]), we can see that Vp, g 

The following bounds imply that up to a constant, the p-divergence with 
any p G (0, 1) is equivalent to the squared Hellinger distance. Therefore a 
convergence bound in any /o-divergence implies a convergence bound of the 
same rate in the Hellinger distance. 

Proposition 3.1. We have the following inequalities Vp G [0, 1]; 
mcQ<.{p,l - p)Dp{q\\p) > ^Di/2{q\\p) > mm{p,l - p)Dp{q\\p). 

Proof. We prove the first half of the two inequalities. Due to the sym- 
metry Dp{q\\p) = Di^p{p\\q), we only need to consider the case p < 1/2. The 
proof of the second half (with p> 1/2) is identical except that the sign in 
the Taylor expansion step is reversed. 

1/2 1/2 

We use Taylor expansion. Let x = ^ — ; then x > — 1 , and there exists 
^ > — 1 such that 

(1 + xfP = l + 2px + p{2p - 1)(1 + ^,fP-'^x^ < 1 + 2px. 

Now taking expectation with respect to q, we obtain 

PY / pl/2_gl/2x2p pl/2_gl/2 



By rearranging the equation, we obtain 2p{jDi/2{q\\p)) < p{l — p)Dp{q\\p). 
□ 



3.1. A general convergence bound. The following theorem is a conse- 
quence of Theorem 2.1. Most of our later discussion can be considered as 
interpretation of this theorem under different conditions. 
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Theorem 3.1. Consider the estimator defined in (1). Let a > 0. 
Then M (0, 1) and 7 > p such that X' = ^^jZp ^ 0, we have 

^X^^wl,{B)Dp{q\\Pm) <^x^.wi{e)Df{q\\p{-\e)) 



ap{l-p) ap{l-p) ^ ^' ^ 



ap(l -p)' 



Where c,,n{a) = llnB^Ef'^'' {S^y = llnB^e-P('-P^('-->^ri'^M-W). 



Proof. Consider an arbitrary data-independent density w{9) € S with 
respect to vr. Using (4), we can obtain from Theorem 2.1 the chain of equa- 
tions 

ap{l- p)BxB^wl,{9)Dp{q\\p{-\9)) 

< ap{l - p)BxB^w'xiO)D^''{q\\pm) 



-aBxB^w^x (0) In exp (^-pln 4My ) 



,7,s 1 1^ , DKhjwf^ dnWdTT) 

pt^nWx >, - in H 

^ n p{Xi\e) n 



+ Cp,n(a) 



Pi' 



< Ex bRx{w) + {p- j)Rx' (w^x)] + Cp ,n(a) 
= 7^a(w^) - (7 - p)Ex-RA'(^i) + Cp,n(a), 

where Rx{w) is defined in (3). Note that the first inequahty uses the fact 
— ln(l— 2;)>2;. The second inequahty follows from Theorem 2.1 with the 
choice ie{^) = P In p(^^ \ l) and (3 = 1. The third inequality follows from the 



definition of in (1). □ 

Remark 3.1. If 7 = p in Theorem 3.1, then we also require A7 = 1, and 
let A' = 0. 



Although the bound in Theorem 3.1 looks complicated, the most impor- 
tant part on the right-hand side is the first term. The second term is only 
needed to handle the situation A < 1 . The requirement that 7 > p is to ensure 
that the second term is nonpositive. Therefore in order to apply the theo- 
rem, we only need to estimate a lower bound of Ryiw^), which (as we shall 
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see later) is much easier than obtaining an upper bound. The third term is 
mainly included to get the correct convergence rate of 0(l/n) for parametric 
problems, and can be ignored for nonparametric problems. The effect of this 
term is quite similar to using localized e-entropy in the empirical process 
approach for analyzing the maximum-likelihood method; for example, see 
[13]. As a comparison, the KL-entropy in the first term corresponds to the 
global e-entropy. 

Note that one can easily obtain a simplified bound from Theorem 3.1 by 
choosing specific parameters so that both the second term and the third 
term vanish: 

Corollary 3.1. Consider the estimator defined in {1). Assume 
that A > 1 and let p = l/X. We have 

BxB^wf,{9)Df-{q\\pm) < mfRx{w). 
^ 1 — p wes 

Proof. We simply let a = 1 and 7 = p in Theorem 3.1. □ 

An important observation is that for A > 1, the convergence rate is solely 
determined by the quantity inf^^s Rxi'w), which we shall refer to as the 
model resolvability associated with S. 

3.2. Some consequences of Theorem 3.1. In order to apply Theorem 3.1, 
we need to bound the quantity BxRx'{wx) from below. Some of these results 
can be found in the Appendix, and by using these results, we are able to 
obtain some refined bounds from Theorem 3.1. 

Corollary 3.2. Consider the estimator Wx defined in (1). Assume 
that A> 1; then^pG (0,1/ A] 

BxB^w^UO)D^%q\\pm) < -^^3^ mf /?aH. 

Proof. We simply let a = 1 and 7 = (1 — p)/(A — 1) in Theorem 3.1. 
Note that in this case. A' = 1, and hence by Lemma A.l in the Appendix, 
we have E i?A' (-^i ) > . □ 

Note that Lemma A.l is only applicable for A' > 1. If A' < 1, then we 
need a discretization device which generalizes the upper e-covering number 
concept used in [2] for showing the consistency (or inconsistency) of Bayesian 
posterior distributions: 
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Definition 3.1. The e-upper bracketing number of F, denoted by 
A^ub(r,e), is the minimum number of nonnegative functions {fj} on 
with respect to such that 'Eg{fj/q) < 1 + e, and G F, 3j such that 
p{x\e)<fj{x) a.e. [ii]. 

The discretization device which we shall use in this paper is based on the 
following definition. 

Definition 3.2. Given a set F' C F, we define its upper-bracketing ra- 
dius as 

rnh(X')= / supp{x\e)dfi{x)-l. 

An e-upper discretization of F consists of a covering of F by countably many 
measurable subsets {Fj} such that IJ^ Tj = F and rub(Fj) < e. 

Using this concept, we may combine the estimate in Lemma A. 2 in the 
Appendix with Theorem 3.1, and obtain the following simplified bound for 
A = 1. Similar results can also be obtained for A < 1. 

Corollary 3.3. Consider the estimator defined in (1). Let A = 1. Con- 
sider an arbitrary covering {Fj} o/F. VpG (0,1) and V7> 1, we have 

¥.x'£^^wl,{e)Df{qM.\e)) 

< ^'^'-f^'f^ + ^p^lnE.(F,)(-^)/(-^)(l + r,,(r,))^ 

p{i-p) p{^-p)n y 

In particular, i/ {F^} is an e-upper discretization ofT, then 



ExE^wUO)Dnq\\pm) 



■J 

R 

'P 



^ 7inf^„gS-RA(^^) ^ 1-p 



lnE,vr(Fp(^-i)/(^-'') 

hln(l + e) 



p{l-p) p{l-p)[ n 

Proof. We let a = 1 in Theorem 3.1 and apply Lemma A. 2. □ 

Note that the above results immediately imply the following bound using 
e-upper entropy by letting 7 — > 1 with a finite e-upper bracketing cover of 
size A'ub(F,e) as the discretization: 

E^E.^i i9)Df^iq\ \pm) < '^'-^^^f^ + i , 

^ p{l — p) ps>oi n 

It is clear that Corollary 3.3 is significantly more general. We are able to 
deal with an infinite cover as long as the decay of the prior vr is fast enough 
on an e-upper discretization so that J2j < +00. 



'"'^"'■(^■^'^i„(i+.) 
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3.3. Weak convergence bound. The case of A = 1 is related to a number 
of important estimation methods in statistical applications. However, for an 
arbitrary prior vr without any additional assumption such as the fast decay 
condition in Corollary 3.3, it is impossible to establish any convergence rate 
result in terms of Hellinger distance using the model resolvability quantity 
alone, as in the case of A > 1 (Corollary 3.2). See Section 4.4 for an example 
demonstrating this claim. However, one can still obtain a weaker convergence 
result in this case. 

Theorem 3.2. Consider the estimator Wj^ defined in (1) with A = 1. 
Then V / : A" ^ [— 1 , 1] , we have 



E 



X 



1 

E^u;i(0)Ep(.|e)/(x)--^/(X- 

2 = 1 



< 2A„ + V2An, 



where Ep(.|£))/(x) = / f{x)p{x\6) dfi{x) is the expectation with respect to p{-\6) 
on X, and An = \nlw&s^xR\{w) + 

Proof. The first half of the proof, leading to (5), is an application of 
Theorem 2.1. The second half is very similar to the proof of Theorem 3.1. 

Let ge{x) = 1 - ef{x), and he{0,x) = ^^j^^, where e e (-1,1) is a 
parameter to be determined later. Note that ge{x) > 0. 

We consider an extension of F to F' = F x {±1}. Let o" = ±1, and 9' = 
{6, a) e F'. We define a prior vr' on F' such that 7r'((6',o-)) = 0.57r(6l). For a 
posterior wf-(^) on F, we consider for n = ±1 a posterior w^-^{9,a) on F' 

such that t?)^jc(0,(7) = 2wx{9) when a = u, and w^^{9,cr) = otherwise. 
Let a = /? = 1 and ie,a{x) = ln/io-e(^,^i)- For all u(X) £ {±1}, we apply 
Theorem 2.1 to the posterior w^j^^ j^, and obtain 

-ExE^t£'x(0) InEge-i'^'^"^^'''^) 



< Ex 



^.wx{9) Er=i lnhue{9,Xi) + DkUwx d7r||d7r) +ln2 



n 



Note that Eqe-''^''"-'^^'''') = Ep(.|0)5fe(2;). Therefore if we let 

A,iX) = E^wl,i9) (j2lng,{X,) - nlnEp^.\e)g,{x)j , 

then 

(5) ExA„(x),(X) < nExi2A(^^i) + ln2 <n inf i?A(«^) +ln2, 

where the second inequality follows from the definition of w^^ in (1). This 
inequality plays the same role as Theorem 2.1 in the proof of Theorem 3.1. 
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Consider x <y < 1. We have the inequalities (which follow from Taylor 
expansion) 

x<-ln{l-x)<x + ^^^-—^. 

2 

2(il|g|)2 and -lnEp(.|g)5r£(x) > eEp(.ie)/(x). 



This implies In ^^(x) > —ef{x) ■ 
Therefore 



A,(X) > eB^wf,{9) - f{Xi) + nEp(.|e)/(x) 



ne 



1=1 



2(1 -lei 



Substitute into (5); we have 



Ex sup iueB^w^xi6)i-J2f(^i)+''^P(- 



\e)f{x) 



ne 



2(l-|e|)2 



<n inf Bx Rx{w) +ln2. 

w&S 



Therefore we have 



EjY 



i=l 



< 



n\e\ 



+ 



nAr. 



2{l-\e\r \e\ • 



Let |e| = ^/2A^/{^/2A^, + l) and we obtain the desired bound. □ 

Note that for all / G [— 1, 1] the empirical average ^ J27=i fi-^i) converges 
to Bqf{x), 



E 



X 



n 



^/(X,)-E,/(x) 



i=l 



< 



It follows from Theorem 3.2 that 

Ex|E,u;|(e)Ep(.|g)/(x) - Eg/(x)| < 2An + ^^K + rT^I''. 



This means that as long as lim„ A„ = 0, for all bounded functions f{x) G 
[—1,1], the posterior average E7rzI;^(0)Ep(.|g)/(x) converges to Eq/(x) in 
probability. Since Theorem 3.2 uses the same weak topology as that in the 
usual definition of weak convergence of measures, we can interpret this re- 
sult to mean the posterior average E^i(;^(^)p(-|0) converges weakly to q in 
probability. In particular, by letting j{x) be an indicator function for an 
arbitrary set i? C we obtain the consistency of the probability estimate. 
That is, the probability of B under the posterior mean 'Bt^w^{Q)-p{-\Q) con- 
verges to the probability of B under q (when lim„ j4„ = 0). 



14 



T. ZHANG 



4. Two-part code MDL on discrete net. The minimum description length 
(MDL) method has been widely used in practice [10]. The two-part code 
MDL we consider here is the same as that of Barron and Cover [1]. In fact, 
results in this section improve those of Barron and Cover [1]. The MDL 
method considered in [1] can be regarded as a special case of information 
complexity minimization. The model space T is countable: ^Gr = {l,2, ...}. 
We denote the corresponding models p{x\6 = j) by Pj{x). The prior vr has 
the form vr = {tti, 7r2, . . .} such that J2j '^i = 1> where we assume that ttj > 
for each j. A randomized algorithm can be represented as a nonnegative 
weight vector w = [wj] such that J2j '^j'^j = 1- 

MDL gives a deterministic estimator, which corresponds to the set of 
weights concentrated on any one specific point k. That is, we can select S 
in (1), where each weight w m. S corresponds to an index /c € F such that 
Wk = l/TTk and Wj = when j / k. It is easy to check that DKhiw dTr\\dTr) = 
ln(l/7rfc). The corresponding algorithm can thus be described as finding a 
probability density pf, with k obtained by 



(6) fc = argmin 

k 



"1 1 
Vln — - — - + Aln — 

Pk[Xi) TTfc 



where A > 1 is a regularization parameter. The first term corresponds to the 
description of the data, and the second term corresponds to the description 
of the model. The choice A = 1 can be interpreted as minimizing the total 
description length, which corresponds to the standard MDL. The choice A > 
1 corresponds to heavier penalty on the model description, which makes the 
estimation method more stable. This modified MDL method was considered 
in [1] and the authors obtained results on the asymptotic rate of convergence. 
However, no simple finite-sample bound was obtained. For the case of A = 1, 
only weak consistency was shown. In the following, we shall improve these 
results using the analysis presented in Section 3. 

4.1. Modified MDL under global entropy condition. Consider the case 
A > 1 in (6). We can obtain the following theorem from Corollary 3.2. 

Theorem 4.1. Consider the estimator k defined in (6). Assume that 
A> 1. Then^pG (0,1/A] 



ExDMPk) < ExDf^iqWpi^) < -^^inf 



DKL{q\\Pk) + -ln — 



The term rx^nig) = inffc[^KL(9||Pfc) + ^li^^] is referred to as index of 
resolvability in [1]. They showed (Theorem 4) that -Di/2('?| = Op{r\^n{q)) 
when A > 1, which is a direct consequence of Theorem 4.1. 
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Theorem 4.1 generalizes a result by Andrew Barron and Jonathan Li, 
which gave a similar inequality but only for the case of A = 2 and p = 1/2. 
The result can be found in [7], Theorem 5.5, page 78. In particular, consider 
r such that |r| = N with uniform prior vTj = 1/A^; one obtains a bound 
for the maximum likelihood estimate over T (take A = 2 and p=l/2 in 
Theorem 4.1), 



(7) ExDi/2(9lk.)<2inf 



2 1 

Dvj.{q\\Pk) + - In — 
n iV 



Examples of indexes of resolvability for various function classes can be 
found in [1], which we shall not repeat in this paper. In particular, it is 
known that for nonpar ametric problems, with appropriate discretization the 
rate resulting from (7) matches the minimax rate, such as those in [16]. 

4.2. Local entropy analysis. Although the bound based on the index of 
resolvability in Theorem 4.1 is quite useful for nonparametric problems, see 
[1], it does not handle the parametric case satisfactorily. To see this, we 
consider a one-dimensional parameter family indexed by € [0,1], and we 
discretize the family using a uniform discrete net of size A^-|- 1, 9j = j/N (j = 
0, . . . , A^). In the following, we assume that q is taken from the parametric 
family, and for some fixed p, both D^'^{q\ \pk) and DKL{q\ \pk) are of the order 
{6 — Okf' . That is, we assume that there exist constants ci and C2 where 

(8) ci{9-ekf<Df{q\\pk), D^Mpk)<c2{e-euf. 

We will thus have mik DYj^{q\\pk) < C2A^~^, and the bound in (7), which 
relies on the index of resolvability, becomes Ex-C'i/2(Q'||Pfc) ^ 0{N~'^) + 
^In-^yi^. Now by choosing A^ = 0(n~^/^), we obtain a suboptimal conver- 
gence rate ExDi/2{q\\p^) < 0(lnn/n). Note that convergence rates estab- 
lished in [1] for parametric examples are also of the order O (In n/n). 

The main reason for this suboptimality is that the complexity measure 
O(lnA^) or 0(— Invr^) corresponds to the globally defined entropy. However, 
readers who are familiar with the empirical process theory know that the 
rate of convergence of the maximum-likelihood estimate is determined by 
local entropy mentioned in [5]. For nonparametric problems, it was pointed 
out in [16] that the worst-case local entropy is of the same order as the global 
entropy. Therefore a theoretical analysis which relies on global entropy (such 
as Theorem 4.1) leads to the correct worst-case rate at least in the minimax 
sense. For parametric problems, at the 0{l/n) approximation level, local 
entropy is constant but the global entropy is Inn. This leads to a ln(n) 
difference in the resulting bound. 

Although it may not be immediately obvious how to define a localized 
counterpart of the index of resolvability, we can introduce a correction term 
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which has the same effect. As pointed out earher, this is essentially the role 
of the Cp^„(a) term in Theorem 3.1. We include a simplified version below, 
which can be obtained by choosing a = 1/2 and "f = p = 1/A. 

Theorem 4.2. Consider the estimator k defined in (6). Assume that 
X> 1, and let p= 1/X. Then 



BxDf%q\\pj^)<-^mi 

^ 1 — p k 



DKh{q\\Pk) + - In 



The bound relies on a localized version of the index of resolvability, 
with the global entropy — In vrfc replaced by a localized entropy In J2j "^j ^ 
^-0Mi-P)nDf^ii\M_ln^^, Since 

j j 

the localized entropy is always smaller than the global entropy. Intuitively, 
we can see that if Pj{x) is far away from q{x), then exp{—p{l — p){l — 
a)nD^'^ {q\\pj)) is exponentially small as n — > oo. It follows that the main 

contribution to the summation in J2j7^je~^'^''^^~^^"'^p'''^'^^^^^^ is from terms 
such that D^^{q\\pj) is small. This is equivalent to a reweighting of the prior 
vTfc in such a way that we only count points that are localized within a small 
D^"" ball of q. 

This localization leads to the correct rate of convergence for parametric 
problems. The effect is similar to using localized entropy in the empirical pro- 
cess analysis. We still consider the same one-dimensional problem discussed 
at the beginning of the section, with a uniform discretization consisting of 
iV + 1 points. We will consider the maximum-likelihood estimate. For one- 
dimensional parametric problems, using the assumption in (8), we have for 
ah N^ = 0{n), 

J2^~p{l~p){l-a)nD^-{q\\pj)^J2^~p{l~p){l~a)ncif/N^^Q^^y 
j j 

Since tTj = 1/{N + 1), the localized entropy 

In^il^lf = 0(1) 

is a constant when N = 0(n^/^). Therefore with a discretization size N = 
0(?i^/^), Theorem 4.2 implies a convergence rate of the correct order 0(l/n). 
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4.3. The standard MDL (A = 1). The standard MDL with A = 1 in (6) 
is more comphcated to analyze. It is impossible to give a bound similar to 
Theorem 4.1 that depends only on the index of resolvability. As a matter of 
fact, no bound was established in [1]. As we will show later, the method can 
converge very slowly even if the index of resolvability is well behaved. 

However, it is possible to obtain bounds in this case under additional 
assumptions on the rate of decay of the prior vr. The following theorem is a 
straightforward interpretation of Corollary 3.3, where we consider the family 
itself O-upper discretization, Fj = {pi}- 

Theorem 4.3. Consider the estimator defined in (6) with A = 1. For 
all p G (0, 1) and V7 > 1, we have 

^ nRe/ II ^ ^ 7inffc[^KL(g||Pfc) + (1/n) ln(l/7rfc)] 
^xD^ {q\\p~,)< 

^ p{l-p)n ^ ^ 

The above theorem depends only on the index of resolvability and the 
decay of the prior vr. If vr has a fast decay in the sense of '^'f ^'^^^'^ < 
+00 and does not change with respect to n, then the second term on the 
right-hand side of Theorem 4.3 is 0(l/n). In this case the convergence rate is 
determined by the index of resolvability. The prior decay condition specified 
here is rather mild. This implies that the standard MDL is usually Hellinger 
consistent when used with care. 



4.4. Slow convergence of the standard MDL. The purpose of this section 
is to illustrate that the index of resolvability cannot by itself determine the 
rate of convergence for the standard MDL. We consider a simple example 
related to the Bayesian inconsistency counterexample given in [2], with an 
additional randomization argument. Note that due to the randomization, we 
shall allow two densities in our model class to be identical. It is clear from 
the construction that this requirement is for convenience only, rather than 
anything essential. 

Given a sample size n, consider an integer m such that n. Let the 
space X consist of 2m points {1, ... ,2m}. Assume that the truth q is the 
uniform distribution, q{u) = l/{2m) for u = l, . . . , 2m. 

Consider a density class T' consisting of all densities p such that either 
p{u) = or p{u) = 1/m. That is, a density p in V takes the value 1/m at m 
of the 2m points, and elsewhere. Now let our model class F consist of the 
true density q with prior 1/4, as well as 2" densities pj (j = 1, . . . , 2") that 
are randomly and uniformly drawn from F' (with replacement), where each 
Pj is given the same prior 3/2"'"'"^. 
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We shall show that for a sufficiently large integer m, with large probability 
we will estimate one of the 2" densities from T' with probability of at least 
1 — e~^/^. Since the index of resolvability is ln4/n, which is small when n 
is large, the example implies that the convergence of the standard MDL 
method cannot be characterized by the index of resolvability alone. 

Let X = {Xi, . . . ,Xn} be a set of n-samples from q and let p be the 
estimator from (6) with A = 1 and T randomly generated above. We would 
like to estimate P{p = q). By construction, p = q only when YVi^iPjiXi) = 
for all pj G r' n r. Now pick m large enough such that (m — n)^ /m^ > 0.5; 
we have 



P{P- 



p(^Pj£T'r^T■.\{pj{x,) = ^ 



ExP Vp,Gr'nr:nPi(^^) = o 



i=l 



X 



ExP\f{pi{Xi) = d 

\i=i 



X 



Ex 1 



2m 



<Ex 1 



m — n 
2m 



n\ 2" 



<(l-2-(-+i))^"<e- 



0.5 



where \X\ denotes the number of distinct elements in X. Therefore with a 
constant probability we have p^ q no matter how large n is. 

This example shows that it is impossible to obtain any rate of convergence 
result using the index of resolvability alone. In order to estimate convergence, 
it is thus necessary to make additional assumptions, such as the prior decay 
condition of Theorem 4.3. The randomization used in the construction is 
not essential. This is because there exists at least one draw (a deterministic 
configuration) that leads to convergence probability (the probability of cor- 
rect estimation) at least as large as the expected convergence probability of 
g-0.5 ^nder randomization. 

We shall also mention that starting from this example, together with a 
construction scheme similar to that of the Bayesian inconsistency counterex- 
ample in [2] , it is not difficult to show that the standard MDL is not Hellinger 
consistent even when the index of resolvability approaches zero as n ^ oo. 
For simplicity, we skip the detailed construction in this paper. 



4.5. Weak convergence of the standard MDL. Although Hellinger con- 
sistency cannot be obtained for standard MDL based on the index of resolv- 
ability alone, it was shown in [1] that as n — > oo, if the index of resolvability 
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approaches zero, then converges weakly to q in probabihty (in the sense 
discussed at the end of Section 3.3). This result is a direct consequence of 
Theorem 3.2, which we shall restate here. 

Theorem 4.4. Consider the estimator defined in (6) with A = 1. Then 
V / : A' — > [— 1 , 1] , we have 

1 " 

Ex E,,/(x)--^/(X,) 

it. 

where An = inffe [Z^klI^I + ^ In ^] + ^ . 



< 2An + ^j2Ar, 



Note that in the sense discussed at the end of Section 3.3, this theorem 
essentially implies that the standard MDL estimator is weakly consistent 
(in probability) as long as the index of resolvability approaches zero when 
n — > oo. Moreover, it establishes a rate of convergence result which depends 
only on the index of resolvability. This theorem improves the consistency 
result in [1], where no rate of convergence result was established and / was 
assumed to be an indicator function. 

5. Bayesian posterior distributions. Assume we observe n-samples X = 
{Xi, . . . , Xn} E Af", independently drawn from the true underlying distribu- 
tion Q with density q. As mentioned earlier, we call any probability density 
wx{0) with respect to vr that depends on the observation X (and measur- 
able on X'^ X r) a posterior. For all 7 > 0, we define a generalized Bayesian 
posterior 7r^(-|X) with respect to tt as (also see [15]) 

We call vr^ the 7-Bayesian posterior. The standard Bayesian posterior is 
denoted as vr(-|X) =7ri(-|X). 

The key starting point of our analysis is the following simple observation 
that relates the Bayesian posterior to an instance of information complexity 
minimization which we have already analyzed in this paper. 



Proposition 5.1. Consider a prior vr and A > 0. Then 
-InE^exp -J^ln^l^ 



/2A(vri/A(-|X))=--lnE^expfiX^ln^^^') =iniRx{w) 



where R\{w) is defined in (2), and the inf on the right-hand side is over all 
possible densities w with respect to the prior vr. 
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Proof. The first equality follows from simple algebra. 
Now let f{e) = iX;r=ilnp(Xi|6l) in Proposition 2.1; we obtain 

-^lnE^exp(/(^)) <infi?AH <i2A(vri/A(-|X)). 

Combining this with the first equality, we know that equality holds in the 
above chain of inequalities. This proves the second inequality. □ 

The above proposition indicates that the generalized Bayesian posterior 
can be regarded as a minimum information complexity estimator (1) with 
S consisting of all possible densities. Therefore results parallel to those of 
MDL can be obtained. 



5.1. Generalized Bayesian methods. Similarly to the index of resolvabil- 
ity complexity measure for MDL, for Bayesian- like methods the correspond- 
ing model resolvability, which controls the complexity, becomes the Bayesian 
resolvahility defined as 



(10) 



r\,n{q) = inf 



A 



^MO)DKL{qM-\0)) + -DKL{wdTT\\dTT) 

n 



n 



In E^e^ 



-(n/A)DKL(9lb(-|e)) 



The density that attains the infimum of (10) is given by 



w{6) oc exp 



n 



The following proposition gives a simple and intuitive estimate of the 
Bayesian index of resolvability. This bound implies that the Bayesian re- 
solvability can be estimated using local properties of the prior vr around the 
true density q. The quantity is small as long as there is a positive prior mass 
in a small KL-ball around the truth q. 



Proposition 5.2. The Bayesian resolvability defined in {10) can he 
hounded as 



r\,n{q) < inf 



e--\n^{{peT:DKi.{q\\p)<e}) 
n 



Proof. For all e > 0, we simply note that E^e-("/^)^KL{g|b( |9)) > e"("A)e x 
7r({p G r : L)kl('z| \p) ^ &})• Now taking the logarithm and using (10), we ob- 
tain the desired inequality. □ 



The following bound is a direct consequence of Corollary 3.2. 
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Theorem 5.1. Consider the generalized Bayesian posterior Tri/x{6\X) 
defined in (9) with A > 1. Then Vp G (0, 1/A] 

BxB^7Tyx{e\X)Df%q\\p{.\9)) < --^^-A_lnE^exp(^-^Z)KL(9||M-|^)))- 

The above theorem gives a general convergence bound on the 7-Bayesian 
method with 7 < 1, depending only on the globally defined Bayesian resolv- 
ability. Note that similarly to Theorem 4.2 for the MDL case, a bound using 
a localized Bayesian resolvability can also be obtained. 

Theorem 5.1 immediately implies the concentration of a generalized Bayesian 
posterior. Define the posterior mass outside an e D^^-hall around q as 

ny,i{pGr:Df%q\\p)>e}\X). 

Using the bound in Theorem 5.1 and Proposition 5.2, we can show that 
with large probability, the generalized Bayesian posterior outside a -D^*^- 
ball of size 0(e) is exponentially small when e ^ £n,n- However, the average 
performance bound in Theorem 5.1 is not refined enough to yield exponential 
tail probability directly under the prior vr. In order to obtain the correct 
behavior, we shall thus consider a prior vr' related to vr which is more heavily 
concentrated on distributions that are far away from q. We choose vr' for 
which Theorem 5.1 can be used to obtain a constant probability of posterior 
concentration. We then translate the concentration of posterior with respect 
to tt' to a concentration result with respect to vr. 

Corollary 5.1. Let A > 1 and p £ (0,1/A]. Then for all t > and 
6 £ (0, 1), with probability at least 1 — 6, 



ny,(^[p£r:Df'^{q\\p)> 



P(A-1)« 



1 + e"*/^ ' 



where the critical prior-mass radius n = inf{e:e > —-lmr{{p G F : 
DKL{q\\p)<e})}. 

Proof. Let = 2(2e,,„ + t)/(p(A - 1)5), Ti = {p £T : D^%q\\p) < £t} 

and T2 = {p£ T:Df%q\\p) > £t}. We let a = e""*/^ and define 7r'(0) = 
a7r{9)C when 6 £Ti and vr'(^) = tt{9)C when 9 E where the normal- 
ization constant C = (avr(ri) +vr(r2))~^ G [1,1/a]. 

Now apply Theorem 5.1 and Proposition 5.2 with the prior vr'. We obtain 
(using the Markov inequality) 

Exn'y,{T2\X)et < ExE^,7T[/,i9\X)Df\q\\p) 

- " p(A-l)n exp (- ^I)kl (g| 1 0) ) 
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< 



< 



p(A- l)n 
1 



lna + lnE^exp( -"^DKLiqM-lO)) 



< 



P(A-I) 
P(A-I) 



t + e^r.n - - lmr{{p £ T : DkUqI \p) < £n,n}) 
n 



In the above derivation, the first inequahty is the Markov inequaUty; the 
second inequahty is from Theorem 5.1; the third inequality follows from vr' > 
avr; the fourth inequality follows from Proposition 5.2; the final inequality 
uses the definition of en,n- 

Now we can divide both sides by et, and obtain with probability 1 — 6 
that 7r'^/^(r2|X) < 0.5. By construction, 7r^/;^(r2|X) = 7ri/A(r2|X)/(a(l- 
'^i/\0^2\X)) +7ri/x{'r2\X)). We can solve for 7ri/;,(r2|X) as tti/x{^2\X) = 
a7r;/,(r2|X)/(l - (1 - a)7r;/,(r2|X)) < a/(l + a). □ 

From the bound, we can see that with large probability the posterior 
probability outside a D^^-ball with large distance t decays exponentially in 
nt and is independent of the complexity of the prior (as long as t is larger 
than the scale of the critical radius £Tr,n)- As we will see later, the same is 
true for the standard Bayesian posterior distributions. 

5.2. The standard Bayesian method. For the standard Bayesian poste- 
rior distribution, it is impossible to bound its convergence using only the 
Bayesian resolvability. The reason is the same as in the MDL case. In fact, 
it is immediately obvious that the example for MDL can also be applied 
here. Also see [2] for a related example. 

Therefore in order to obtain a rate of convergence (and concentration) for 
the standard Bayesian method, additional assumptions are necessary. Sim- 
ilarly to Theorem 4.3, bounds using upper-bracketing radius can be easily 
obtained from Corollary 3.3. 

Theorem 5.2. Consider the Bayesian posterior 7r(-|X) = 7ri(-|X) de- 
fined in (9). Consider an arbitrary cover {Tj} of T . Then VpG (0,1) and 
7 > 1, we have 

^x^Mo\x)Df{qM-\e)) 

7 In E7re~"^*^i'('j||p(-|f)) 
p{p-l)n 

+ ;^(r^ inE -(r,)^--^)/^-"'') (1 + rub(^,))^ 
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For all e > 0, consider an e-upper discretization {r^} of T. We obtain 
from Theorem 5.2, 



BxBMO\X)D^%q\\pm) 
7lnE^e-"-^KL(<?l|p(-|^?)) 



p{p-l)n 




In particular, let 7 — > 1. We have 



Bx-E.MO\X)Df%q\\p{.\e)) 




where A^ub(r)£) is the e-upper-bracketing covering number of T. 

Similarly to Corollary 5.1, we obtain the following concentration result 
for the standard Bayesian posterior distribution from Theorem 5.2. 

Corollary 5.2. Let e^,„ = inf{e:e > -^lmT{{p e T:DKL{q\\p) < e})} 
be the critical prior-mass radius. Let p G (0, 1). For all s G [0, 1] , let 



he the critical upper-bracketing radius with coefficient s, where {Tj} denotes 
an arbitrary covering ofT. Now Vp € (0, 1) and 7 > 1, let 



En = 27e,r,n + (7 " /0)eupper,n((7 " l)/(7 " P)) ■ 

We have for all t>0 and 6 £ (0, 1), with probability at least 1 — 6, 



Proof. The proof is similar to that of Corollary 5.1. We let ej = (2e„ + 



(47 - 2)t)/{{p - p^)6). Define Ti = {p er:D^%q\\p) < et} and = {p G 
T:Df%q\\p) > et}. We let a = e""* and define 7r'(6') = aTr{e)C when 6 G 



Ti and '7t'{9) = ■k{9)C when G where the normalization constant C = 



Using Proposition 5.2 and the assumption of the theorem, we obtain 



Supper ,n 



(s) = -inf ln^7r(r,r(l + r„b(r,)) 



n 




(a7r(ri)+7r(r2))"^G [1,1/a]. 



7 In E^/e 



nDiii^{q\\p(-\e)) 



p{p-l)n 
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^ 7t + (7/n)lnE^e-"-^KL(9l|p(-|e)) 



P(l-P) 



(7-l)t 
7-P 



+ e 



upper ,n 



0" 






7)1 



(27 - l)t + 
- P(l-P) 

In the first inequality, we have used the fact that aTr{6) < it' (6) < Tr{9)/a. 
Similarly to the proof of Corollary 5.1, we can use Markov inequality to 
obtain 7r'(r2|X) < 0.5 with probability 1 — 5. This leads to the desired bound 
for 7r(r2|X) = a7r'y,{T2\X)/{l - (1 - a)</,(r2|X)). □ 

In this theorem, we can use the estimate 



is) < inf 

£>0 



-lniVub(r,e) +ln(l + e) 
n 



where A^ub(r5 £) is the upper-bracketing covering number of T at scale e. The 
result implies that if the critical upper-bracketing radius eupper,n is at the 
same (or smaller) order of the critical prior- mass radius e7r,n; then with large 
probability, the standard Bayesian posterior distribution will concentrate in 
a D^'^-ball of size £Tr,n- In this case, the standard Bayesian posterior has 
the same rate of convergence when compared with the generalized Bayesian 
posterior with A > 1. However, if Eupper.n is large, then the standard Bayesian 
method may fail to concentrate in a small D^'^-hall around the truth q, even 
when the critical prior radius e7r,n is small. This can be easily seen from the 
same counterexample used to illustrate the slow convergence of the standard 
MDL. 

Although the standard Bayesian posterior distribution may not concen- 
trate even when 671- n is small. Theorem 3.2 implies that the Bayesian density 
estimator Et^tt{6\X)p{-\X) is close to q in the sense of weak convergence. 

The consistency theorem given in [2] also relies on the upper covering 
number A^ub(r,e). However, no convergence rate was established. Therefore 
Corollary 5.2 in some sense can be regarded as a refinement of their anal- 
ysis using their covering definition. Other kinds of covering numbers (e.g., 
Hellinger covering) can also be used in convergence analysis of nonparamet- 
ric Bayesian methods. For example, some different definitions can be found 
in [4] and [12]. 

The convergence analysis in [12] employed techniques from empirical process- 
es, which can possibly lead to suboptimal convergence rates when the cov- 
ering number grows relatively fast as the scale e ^ 0. We shall focus on 
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[4], which employed techniques from hypothesis testing in [6]. The resulting 
convergence theorem from their analysis cannot be as simply stated as those 
in this paper. Moreover, some of their conditions can be relaxed. Using tech- 
niques of this paper, we can obtain the following result. The proof, which 
requires two additional lemmas, is left to the Appendix. 

Theorem 5.3. Consider a partition ofT as the union of countably many 
disjoint measurable sets Tj (j = 1, . . .). Then V/9 G (0, 1) and 7 > 1 

. peco(rj) 

p{l-p)n 

where co(rj) is the convex hull of densities in Tj, vr(rj) = /p, dTT{6) is 
the prior probability of Tj and 7r(rj|X) = Jj- .Y\i=iP{Xi\6) d'7T{9)/ 
Jj^YVi^iP^XilO) d'iT{6) is the B ay esian posterior probability ofTj. 

An immediate consequence of the above theorem is a result on the con- 
centration of Bayesian posterior distributions that refines some aspects of 
the main result in [4]. It also complements the upper-bracketing radius- 
based bound in Corollary 5.2. For simplicity, we only state a version for 
/9-divergence so that the result is directly comparable to that of [4]. A simi- 
lar bound can be stated for Renyi entropy. 

Corollary 5.3. Let e^^.n = inf{e:e > -^ln7r({p G T:DKL{q\\p) < e})}- 
Given p G (0, 1), we assume that Ve > 0, {p G T:Dp{q\\p) > e} can be cov- 
ered by the union of measurable sets T^ (j = 1, . . .) such that va.i{Dp{q\\p) :p G 
UjCo(rp} > e/2. For all s G [0,1], let 



,n(s) 



sup< eo : eo < - sup inf ln( ^vr(r|)'' + 2 ] i 



be the critical convex-cover radius. Now V7 > 1 let 

En = I^Et^^u + (7 - /3)econv,n((7 " l)/(7 " P))- 

For all t>0 and 6 G (0, 1), with probability at least 1 — 5, 

1 



X < 



1 + e 



nt 



Proof. Let Et = 4(e„ -\- (27 - l)t)/(p(l - p)6). Similarly to the proof 
of Corollary 5.1, we define Ti = {p eT : Dp{q\\p) < Et}, Ta = r -Fi. We let 
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a = e""* and define 7r'(6') = 07r(6')C when 6* G Ti and 7r'(6l) = t:{9)C when 
S where the normahzation constant C = (a7r(ri) +7r(r2))~^ E [1, 

Let r[) = {p e r:Z)KL(g||p) < evr,™}- Since DKL(g||p) = Do{q\\p) and 
£t ^ e7r,n/ miii(/9, 1 — p), we know from Proposition 3.1 that T'q C Ti. Let 
r'_]^ = Fi — Lq. By assumption, it is clear that r2 can be partitioned into 
the union of disjoint measurable sets {L^} (j > 1) such that T'j C L^* and 
infpg|j co(r'.) ^pilWp) — For this partition, we have 



j>-i peco(r^) 



BxTr'{T2\X)et/2<-Ex ^ ^'(r;-|X)_inf^^^Dp(g||p). 

Note that 

7-1 



.i>i 



In y vr'(r')(^-i)/(T-^) < - J— l-lna + ln 

< In a + neconv,n 

and 

-In 2^ vr (r^je ' <n sup L>klW| |p) - Invr (ro) 

j>-i pGco(r[,) 

Combining the above estimates, and plugging them into Theorem 5.3, we 
obtain 

^ „^ , ^^ / (7 - /')(-(lna)(7 - l)/(7 " p) + ) + 7(2ne^ .„ + nt) 

Exvr A) < r 

p[l - p)net/2 

= 0.56. 

Therefore 7t'{T2\X) < 0.5 with probability 1 — 5. The desired bound for 
7r(r2|X) can be obtained from 7r(r2|X) = a7r^/_^(r2|X)/(l - (1 -a)7r;/;^(r2|X)). 
□ 



If we can cover {p £ T:Dp(q\\p) > e} by convex measurable sets Tj 
(j = 1, . . . , Ni;) such that mf{Dp{q\ \p):p £ [jj Tj} > e/2, then we may take 
7 = 1 in Corollary 5.3 with Econv.n defined as 

econv,n = supj eo : £0 < " In ( sup A''^ + 2 

Clearly if i IniV, = 0(e^,„) for some e = 0(e,r,n), then with large probabil- 
ity Bayesian posterior distributions concentrate on a Dp-ball of size 0(e7r,n) 
around q. Note that this result relaxes a condition of [4], where our def- 
inition of e,r,n was replaced by possibly smaller balls {p £T : DKhiqWp) < 
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e,Eqln(^)^ < e}. Moreover, their covering definition A^^^ does not apply to 
arbitrary convex covering sets directly (although it is not difficult to modify 
their proof to deal with this case), and their result does not directly handle 
noncompact families where = oo (which can be directly handled by our 
result with 7 > 1). 

It is worth mentioning that for practical purposes, the balls {p £ T : 
DKL{q\ \p) < Eg ln(^)2 < e} and {peV: Dxhiql \p) < e} are usually of com- 
parable size. Therefore relaxing this condition may not always lead to sig- 
nificant practical advantages. However, it is possible to construct exam- 
ples such that this refinement makes a difference. For example, consider the 
discrete family F = {pj} (j > 1) with prior ttj = + 1). Assume that 

the truth q{x) is the uniform distribution on [0,1], and Pj{x) = when 
X G [0, j~^/2] and Pj{x) = {f - 2~^~'^)/{f - 0.5) otherwise. It is clear that 
Egln(^)^ > 0.5 In 4, while \unj^ooD]^i.{q\\Pj) = 0. Therefore the result in [4] 
cannot be applied, while Corollary 5.3 implies that the posterior distribution 
is consistent in this example. 

Applications of convergence results similar to Corollary 5.2 and Corol- 
lary 5.3 can be found in [4] and [12]. It is also useful to note that Corollary 5.1 
requires less assumptions to achieve good convergence rates, implying that 
generalized Bayesian methods are more stable than the standard Bayesian 
method. This fact has also been observed in [15]. 

6. Discussion. This paper studies certain randomized (and determinis- 
tic) density estimation methods which we call information complexity min- 
imization. We introduced a general KL-entropy based convergence analysis, 
and demonstrated that this approach can lead to simplified and improved 
convergence results for MDL and Bayesian posterior distributions. 

An important observation from our study is that generalized information 
complexity minimization methods with regularization parameter A > 1 are 
more robust than the corresponding standard methods with A = 1. That 
is, their convergence behavior is completely determined by the local prior 
density around the true distribution measured by the model resolvability 
mlw^s For MDL, this quantity (index of resolvability) is well behaved 

if we put a not too small prior mass at a density that is close to the truth 
q. For the Bayesian posterior, this quantity (Bayesian resolvability) is well 
behaved if we put a not too small prior mass in a small KL-ball around 
q. We have also demonstrated through an example that the standard MDL 
(and Bayesian posterior) does not have this desirable property. That is, even 
if we can guess the true density by putting a relatively large prior mass at 
the true density g, we may not be able to estimate q very well as long as 
there exists a bad (random) prior structure even at places very far from the 
truth q. 
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Therefore, although the standard Bayesian method is "optimal" in a cer- 
tain averaging sense, its behavior is heavily dependent on the regularity of 
the prior distribution globally. Intuitively, the standard Bayesian method can 
put too much emphasis on the difficult part of the prior distribution, which 
degrades the estimation quality in the easier part in which we are actually 
more interested. Therefore even if one is able to guess the true distribution 
by putting a large prior mass around its neighborhood, the Bayesian method 
can still behave poorly if one accidentally makes bad choices elsewhere. This 
implies that unless one completely understands the impact of the prior, it 
is much safer to use a generalized Bayesian method with A > 1. 

APPENDIX 

A.l. Lower bounds of Ex-R A' (''i'x)" In order to apply Theorem 3.1, we 
shall bound the quantity ExRx'i^x) from below. 

Lemma A.l. For all X' > I, ExRx'{w^) > lnE^E^(^^)i/^' > 0. 



Proof. The convex duality in Proposition 2.1 with f{x) = -jrY.i=i^^ p{x \l) 
implies 



Rx' > - ^ In E. exp f - 1 X: 1^ : '^^^^ 



A'^ p{Xi\9)J 

Now by taking expectation and using Jensen's inequality with the convex 
function -(/'(x) = — ln(3;), we obtain 

ExRx'iwf,) > -^lnExE.expf-lx:in: '^^^^'^ ^ 



^-^lnE.E^(^)^^^0, 

which proves the lemma. □ 

Lemma A. 2. Consider an arbitrary cover {Pj} o/P. The following in- 
equality is valid V A' G [0, 1] .' 

Exi2A'(^i)>-^inE^(r,)^'(i + ^ub(r,)r, 

3 

where r^]-, is the upper-bracketing radius in Definition 3.2. 
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Proof. The proof is similar to that of Lemma A.l, but with a shghtly 
different estimate. We again start with the inequahty 



A' 



1 



Ry{w^)> InE^exp -^Y.^n 



A'^ p{x^\e))■ 

Taking expectation and using Jensen's inequahty with the convex function 
tjj{x) = — ln(x), we obtain 



-E_v«.(.^4)<i.nE.E^xp(-i|:i..||l) 



< -InEx 
n 



< -InEx 

n 



5:vr(r,)^'exp -^In 



■In 



n 



~r{ supegr,P(^i|^), 



i=l 



■In 



n 



^7r(r,)^'(i + r„b(r,)r 



The third inequahty follows from the fact that V A' G [0, 1] and positive num- 
bers{aj}, (Ejai)^'<Ei<- □ 

A. 2. Proof of Theorem 5.3. The proof requires two lemmas. 

Lemma A. 3. Consider a partition ofT as the union of countahly many 
disjoint measurable sets Tj [j = 1, . . .). Let 

n „ n 

qiX) = Y[q{Xi), p^{X) = —— Y[p{X,\9)d7r{e). 

i=i ^y^j)-^^Ji=i 

Then we have V/9 G (0, 1) and 7 > 1, 
ExJ2<^3\X)Df%q{X')\\p,iX')) 



< 



(7 - P) InEj 7r(rj)(T-i)/(T-^) - 7lnEj ^(rj)e-^KL{?(-Y')lb,(X')) 



P{l-P) 

where X',X G A"", q{X) =nr=i'?(^i) density of X and 

1 r " 
p^{X) = —— X{p{X,\e)dT:{e) 
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is the mixture density over Tj under vr. 

Proof. We shall apply Corollary 3.3 with a slightly different interpre- 
tation. Instead of considering X as n independent samples Xi as before, 
we simply regard it as one random variable by itself. Consider the family 
r' which consists of discrete densities pj{X), with prior iTj =7r{Tj). This 
discretization itself can be regarded as a 0-upper discretization of T' . Also, 
given X, it is easy to see that the Bayesian posterior on T' with respect to 
{iTj} is TTj = TT{Tj\X). We can thus apply Corollary 3.3 on V , which leads 
to the stated bound [with the help of (10)]. □ 

In order to apply the above lemma, we also need to simplify D^'^{q{X')\ {X')) 
and DKUqiX')\\p,{X')). 

Lemma A. 4. We have the bounds 



pGco(rj) ^ n 

sup 

peco(r,-) 



< sup Df%q{X,)\\p{X,)) 



and 



inf DKLiq{Xi)\\piXi)) < 



DKMX)\\pjiX)) 



peco(rj) n 



< sup DKL{q{Xi)\\p{Xi)). 
peco(rj) 



Proof. Since Dkl{q\ \p) = limp^o+ D^'^ill we only need to prove the 
first two inequalities. The proof is essentially the same as that of Lemma 4 
on page 478 of [6] , which dealt with the existence of tests under the Hellinger 
distance. We include it here for completeness. 

We shall only prove the first half of the first two inequalities (the second 
half has an identical proof) and we shall prove the claim by induction. If 
n = l, then since Pj{X) E co(rj) the claim holds trivially. Now assume that 
the claim holds for n = k. For n = A; + 1, if we let 



/r,nf=ipW|e)<i»(«)- 

then 

eM-pil-p)Df%q{X)\\p,{X))) 
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^Exi,...,Xfe 



IT,U^=MX^\e)d7^{9)Y 



<E 



X-i ,...,Xi. 



/r^, w{e\Xi, Xk)p{Xk+i\6) d7r{9) ^ p 

iMr,) L u■=lP{x^\e)d7^{e)Y 



X sup 

peco{r,) 



fp{Xk+i)y 



Exi,...,Xfc 



X sup e 
pgco(rj) 



i/vr(r,)/r^ U'l=Mx^\e)d7^{e)Y 



Ullqix^) 

-pil-p)Df-{q{Xk + -L)MXk+-L)) 



^^-p(l-p)fcinfpe,„(P^,)DR-(g(Xi)|b(Xi)) _ g-p(l-p)i??^(<?(X,+i)llp{^/c+i)) 

pGco(rj) 

= exp(-/,(l-p)n inf Df%q{X^)\\piX,))). 

This proves the claim for n = k + 1. Note that in the above derivation, the 
first of the two inequahties follows from the fact that with fixed Xi , . . . , X^ , 
the density = J^. , Wi{6\Xi, . . . ,Xk)p{Xk+i\9) dTr{9) G co(rj); the sec- 

ond of the two inequalities follows from the induction hypothesis. □ 

Proof of Theorem 5.3. We simply substitute the estimates of Lem- 
ma A. 4 into Lemma A. 3. □ 
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